fix: Avoid unnecessary type casts in `concat_ws` by neilconway · Pull Request #20436 · apache/datafusion

neilconway · 2026-02-19T16:03:22Z

Which issue does this PR close?

Closes concat_ws does unnecessary type casts #20434.

Rationale for this change

concat_ws returned Utf8, regardless of the input types it was called with. If it was called with LargeUtf8, returning Utf8 might overflow. In general, functions like these should operate on all three string representations unless there is a compelling reason not to (e.g., this is how concat works).
simplify_concat_ws always constructed new literals with type Utf8. This lead to unnecessary casts when its inputs were of a different string type.

What changes are included in this PR?

Support concat_ws return type matching its input types, following how concat does it.
In simplify_concat_ws, construct literals with the right type, not always Utf8
Refactor return_type for concat to be more readable
Make StringViewArrayBuilder API more similar to the other string array builders, WRT null handling
Add new unit and SLT tests
Update test output for changed types

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes: some queries involving concat_ws will now omit unnecessary cast operations, and the return type of concat_ws might be any of the three string types. Generally these changes should match user expectations better than the previous behavior.

Omega359 · 2026-02-20T17:11:54Z

I did a quick look at the changes and nothing obvious jumped out at me. I'll try and find time to do a more extensive review if no one else beats me to it.

neilconway · 2026-02-20T17:13:40Z

@Omega359 Thank you!

Omega359 · 2026-02-21T14:36:24Z

🤖 /home/bruce/gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.18.12-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 16 18:58:26 UTC 2026 x86_64 GNU/Linux
Comparing neilc/concat-ws-type-fixes (9c0b4f4) to c3f0807 diff
BENCH_NAME=concat_ws
BENCH_COMMAND=cargo bench --bench concat_ws
BENCH_FILTER=
BENCH_BRANCH_NAME=neilc_concat-ws-type-fixes
Results will be posted here when complete

Omega359 · 2026-02-21T14:41:07Z

🤖: Benchmark completed

Details

group                                  main                                   neilc_concat-ws-type-fixes
-----                                  ----                                   --------------------------
concat_ws function/concat_ws/1024      1.00     11.1±0.08µs        ? ?/sec    1.16     12.9±2.43µs        ? ?/sec
concat_ws function/concat_ws/4096      1.00     44.1±0.88µs        ? ?/sec    1.02     44.8±1.95µs        ? ?/sec
concat_ws function/concat_ws/8192      1.00     88.5±3.98µs        ? ?/sec    1.02     90.4±3.22µs        ? ?/sec
concat_ws function/concat_ws/scalar    1.00     28.3±0.15µs        ? ?/sec    1.02     28.8±0.41µs        ? ?/sec

Omega359 · 2026-02-21T14:59:47Z

datafusion/functions/src/string/concat_ws.rs

-                builder.append_offset();
-                continue;
+        match return_datatype {
+            DataType::Utf8View => {


I wonder if all this duplicated code could be eliminated with an approach similar to

datafusion/datafusion/functions-nested/src/string.rs

Line 792 in 2a08013

trait StringArrayBuilderType: ArrayBuilder {

?

Yeah, I think that would make sense to do. I'm inclined to do it as a follow-up PR -- let me know if you'd prefer it as part of this PR.

I think that is fine.

Omega359 · 2026-02-24T12:48:45Z

LGTM. @Jefffrey, @alamb I believe this is ready for final review and approval.

Omega359 · 2026-02-25T14:32:34Z

datafusion/functions/src/string/concat.rs

-        Ok(dt.to_owned())
+        if arg_types.contains(&Utf8View) {
+            Ok(Utf8View)
+        } else if arg_types.contains(&LargeUtf8) {


I had a thought about this. I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32) https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.LargeUtf8

Interesting point. I believe the typical precedence today is Utf8View > LargeUtf8 > Utf8, partly on the grounds that "StringArray to StringViewArray is cheap but not vice versa". I can see arguments for both sides; if we want to reconsider this, seems like a distinct issue?

datafusion/datafusion/expr-common/src/type_coercion/binary.rs

Lines 1663 to 1667 in e894a03

/// Coercion rules for string view types (Utf8/LargeUtf8/Utf8View):

/// If at least one argument is a string view, we coerce to string view

/// based on the observation that StringArray to StringViewArray is cheap but not vice versa.

///

/// Between Utf8 and LargeUtf8, we coerce to LargeUtf8.

Likely it should be another issue as it likely occurs in a few places. I am fairly certain I am correct on the proper type ordering here but in the wild I doubt it would be encountered much - just how many columns would have > 2 billion bytes?

Also, this is really only pertinent in areas that pick a return type based on multiple columns. For the typical case where the udf is operating on a single column the existing logic should be fine - such as in btrim

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

Indeed, that was the point I was trying to get across. It's a rare ... but possible. Though honestly I expect DF would fail somewhere else pretty quickly if a column with data that big was ever encountered.

neilconway · 2026-02-27T13:46:02Z

@alamb This is ready to be reviewed and/or merged, I think.

Initial work

b976e16

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 19, 2026

neilconway added 2 commits February 19, 2026 11:22

Fix bug with NULL separators, add tests

51a7e87

cargo fmt

9c0b4f4

Omega359 reviewed Feb 21, 2026

View reviewed changes

Omega359 reviewed Feb 25, 2026

View reviewed changes

Tim-53 mentioned this pull request Mar 1, 2026

Update string UDF's to have return type == input type where appropriate #20585

Open

	/// Coercion rules for string view types (Utf8/LargeUtf8/Utf8View):
	/// If at least one argument is a string view, we coerce to string view
	/// based on the observation that StringArray to StringViewArray is cheap but not vice versa.
	///
	/// Between Utf8 and LargeUtf8, we coerce to LargeUtf8.

Conversation

neilconway commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Omega359 commented Feb 20, 2026

Uh oh!

neilconway commented Feb 20, 2026

Uh oh!

Omega359 commented Feb 21, 2026

Uh oh!

Omega359 commented Feb 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilconway commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Feb 19, 2026 •

edited

Loading